Distributed Monitoring and Management of Exascale Systems in the Argo Project

نویسندگان

  • Swann Perarnau
  • Rajeev Thakur
  • Kamil Iskra
  • Kenneth Raffenetti
  • Franck Cappello
  • Rinku Gupta
  • Peter H. Beckman
  • Marc Snir
  • Henry Hoffmann
  • Martin Schulz
  • Barry Rountree
چکیده

New computing technologies are expected to change the highperformance computing landscape dramatically. Future exascale systems will comprise hundreds of thousands of compute nodes linked by complex networks—resources that need to be actively monitored and controlled, at a scale difficult to manage from a central point as in previous systems. In this context, we describe here on-going work in the Argo exascale software stack project to develop a distributed collection of services working together to track scientific applications across nodes, control the power budget of the system, and respond to eventual failures. Our solution leverages the idea of enclaves: a hierarchy of logical partitions of the system, representing groups of nodes sharing a common configuration, created to encapsulate user jobs as well as by the user inside its own job. These enclaves provide a second (and greater) level of control over portions of the system, can be tuned to manage specific scenarios, and have dedicated resources to do so.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Self-Starting Control Chart and Post Signal Diagnostics for Monitoring Project Earned Value Management Indices

Earned value management (EVM) is a well-known approach in a project control system which uses some indices to track schedule and cost performance of a project. In this paper, a new statistical framework based on self-starting monitoring and change point estimation is proposed to monitor correlated EVM indices which are usually auto-correlated over time and non-normally distributed. Also, a new ...

متن کامل

An ex ante control chart for project monitoring using earned duration management observations

In the past few years, there has been an increasing interest in developing project control systems. The primary purpose of such systems is to indicate whether the actual performance is consistent with the baseline and to produce a signal in the case of non-compliance. Recently, researchers have shown an increased interest in monitoring project’s performance indicators, by plotting them on the S...

متن کامل

Towards Measuring the Project Management Process During Large Scale Software System Implementation Phase

Project management is an important factor to accomplish the decision to implement large-scale software systems (LSS) in a successful manner. The effective project management comes into play to plan, coordinate and control such a complex project. Project management factor has been argued as one of the important Critical Success Factor (CSF), which need to be measured and monitored carefully duri...

متن کامل

Project Managers Competencies based on ICB and Project Management Processes based on PMBOK in Project Based Organization (Case study: Hydropower Plants Management)

Effective implementation of managerial systems needs software and hardware requirements. Project management competencies of the managers is one of the most important and inevitable requirements to ensure the success of the projects in any industry. Inorder to clarify the requirements, many international and professional instituts have presented well-known frameworks to help the managers to shap...

متن کامل

A Systems Dynamics Model for Project Management systems of Project-Based Organization

It is obvious that the success of a project-based organization is dependent on its projects. A variety of tools such as the project excellence model, project management maturity models, the earned value method, have been developed in this regard, but there are still delays in projects because the projects have dynamic nature with non-linear relationships and feedback processes during the projec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015